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Abstract: We propose an approach for testing the hypothesis that two realizations of the random 
variables in the form of histograms are taken from the same statistical population (i.e. that two 
histograms are drawn from the same distribution). The approach is based on the notion "signif- 
icance of deviation". Our approach allows also to estimate the statistical difference between two 
liistograms. 
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1. Introduction 

The problem of the testing the hypothesis that two histograms are drawn from the same distribution 
is a very important problem in many scientific researches. For example, this problem exists for the 
monitoring of the experimental equipment in an experiment. Several approaches to formalize and 
resolve this problem were considered [^. Recently, the comparison of weighted histograms was 
developed in paper |^. 

In this note we propose a method which allows to estimate the value of statistical difference 
between histograms. 

2. Significance of deviations 

In paper ^ several types of significances of deviation (or significance of an enhancement [Q]) 
between two values were considered: 

A. significance of deviation between two expected realizations of random variables (for exam- 
ple, 5,12 il); 

B. significance of deviation between the observed value and expected reahzation of random 
variable (for example, Scp [^]); 

C. significance of deviation between two observed values. 

As shown (in particular, in paper [^), many of these significances obey the distribution close 
to the standard normal distribution if both values are taken from the same statistical population. 
This property is used here for the estimation of statistical difference between two histograms. We 
consider the significance of type C in this note. 



- 1 - 



3. Model 



Let us consider a simple model with two histograms where the random variable in each bin obeys 
the normal distribution 

1 



(p{x\nik) = e '".^ . (3.1) 



Here the expected value in the bin / is equal to riik and the variance a^.^ is also equal to nik. k is the 
histogram number {k = 1,2). 
We define the significance as 

Si = , ' ■ (3.2) 

Here /i,jt is an observed value in the bin / of the histogram k and a^^ = /i,^;. 

This model can be considered as the approximation of Poisson distribution by normal distri- 
bution. So, we suppose that the values /i,jt, (/ = 1,2, .. . ,M, A; = 1,2) are the numbers of events 
appeared in the bin / for the histogram k. We consider the RMS (the root mean square) of the 
distribution of the significances 



RMS 



M 



Here 5 is a mean value of 5,. The RMS has the meaning of the "distance measure" between two 
histograms. Note that the observed value of the RMS can be converted to the p— value. If total 
number of events Ni in the histogram 1 and total number of events in the histogram 2 are 
various then the normalized significance is used 

Sm= , (3.3) 

where K = —. 

N2 

Let us consider several examples. 



4. Examples 

All calculations, Monte Carlo experiments and histograms presentation in this note are performed 
using ROOT code [^]. The number of the bins M is equal to 1000. Histograms are obtained from 
independent samples. 

4.1 Uniform distribution 

Consider the case when expected values nn in the first histogram is 66 and the expected values «,-2 
in the second histogram is 45 for each bin number / = 1, 2, . . . ,M. The results of the Monte Carlo 
experiment for this example are presented in Fig. |T] 
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Figure 1. Uniform distributions: the observed values n,i in the first histogram (left,up), the observed values 
n,2 in the second histogram (left, down), observed significances Si bin-by-bin (right, up), the distribution of 
observed significances (right down). 



One can see that the distiibution of observed significances is close to normal distribution with 
the RMS ~ 1. The average significance is ~ 1.96, because total numbers of events in the his- 
tograms are different. 

In Fig. ^ the corresponding histograms for normalized significances are shown. 
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Figure 2. Uniform distributions: observed normalized significances 5, bin-by-bin (left), the distribution of 
observed normalized significances (right). 

The distribution of observed normalized significance is close to standard normal distribution. 
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4.2 Triangle distribution 

Consider the case when the expected values tiik in both histograms are equal to /, where / (/ = 
1,2, .. . ,M) is a bin number and ^ is a histogram number (k = 1,2). It means that the rates of 
events in different bins are different. One can find that in this case the distributions of observed 
significances is also close to standard normal distribution, see Fig. ^ It means that the histograms 
which have different expected values of events in different bins give the distribution of significances 
close to standard normal distribution. 




Figure 3. Triangle distributions: the observed values n,i in the first histogram (left,up), the observed values 
n,2 in the second histogram (left, down), observed significances Si bin-by-bin (right, up), the distribution of 
observed significances (right down). 



Suppose, the histograms are taken from experiments with different integrated luminosity, i.e. 
the total numbers of events in histograms are different. In this case the observed significances 
are changed from bin to bin (see. Fig. Correspondingly, the distribution of significances has 
non-gaussian shape (in contrast with previous distributon of significances (see, Fig. ^). 



For the normalized significance (Eq. 3.3) we have the standard normal distribution of signifi- 
cances (see. Fig. 

So, if two histograms are obtained from the same flow of events then the distribution of the 
normalized significance obeys to the distribution which is close to the standard normal distribution. 
The RMS of the distribution of significances is a measure of statistical difference between two 
histograms and, correspondingly, between two flows of events. This "distance measure" between 
two histogram has a clear interpretation: 

• RMS = - histograms are identical; 



-4- 




Figure 4. Triangle distributions: the observed values hn in the first histogram (left,up), the observed values 
n,2 in the second histogram (left, down), observed significances Si bin-by-bin (right, up), the distribution of 
observed significances (right down). 
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Figure 5. Triangle distributions: observed normalized significances 5, bin-by-bin (left), the distribution of 
observed normalized significances (right). 



• RMS ~ 1 - both histograms are obtained (by the using independent samples) from the same 
parent distribution; 

• RMS >> 1 - histograms are obtained from different parent distributions. 
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